Social networks such as Facebook are becoming more and more important for online news services: an increasing number of their readers access the news pages via links in the networks. Users of Facebook, for example, can use their profile to share links to external websites - such as news portals - with their online friends. This has led to the development of social media into an important generator of traffic on the internet pages. In Germany, 94% of online shared news articles in 2015 are distributed via Facebook, followed by Twitter with 3.5% and Google+ with 2.3% . The advertising-financed business model of the media houses is based on the premise that users visit their websites in order to achieve high advertising revenues. For this reason, news agencies are particularly interested in finding out which topics are more likely shared on these platforms. show, that social media users choose a certain site depending on the researched topic. FOCUS Online for example is targeted for articles from politics and business, sports news is more likely to be shared from Bild.de.
While these static resorts give an indication on the content of an article, multiple articles in the same resort probably don’t cover the same topics (and are not equally shared). Especially if the articles originate from different news portals. Furthermore, articles can contain more than one topic. We use a structural topic model to reveal the underlying topics of a collection of articles (a corpus), and how the articles exhibit them. We then estimate the effect of topic prevalence on the number of Facebook shares.
Mapping raw text to one or more topics, without having prior knowledge on what those topics are, translates to an unsupervised classification problem on natural language. Within topic models the Latent Dirichlet Allocation (LDA) is a widely used technique, where each document (article) is viewed as a mixture of topics (represented by the document-topic distribution) and each topic is a mixture of unique terms (represented by the topic-term distribution).
This model views the text generation process as conforming to the following characteristics:
To “learn” the topic prevalence and the topic-term distribution, collapsed Gibbs sampling can be used. One of the important considerations of this model is that the number of topics \(k\) must be known a-priori.
We load libraries needed for this analysis.
suppressPackageStartupMessages({
library(dplyr) # Data manipulation
library(stringr) # String manipulation
library(lubridate) # Date and time manipulation
library(purrr) # Functional programming
library(tidyr) # Reshaping
library(magrittr) # Advanced piping
library(pushoverr) # Pushover notifications
library(doMC) # Parallel Computing
library(readr) # Importing data
library(tibble) # Better data frames
library(ggplot2) # Static data visualization
library(ggrepel) # Repel text labels
library(ggiraph) # GGplot interactive
library(scales) # Scales
library(viridis) # Viridis color scales
library(htmlwidgets) # JS visuliaztions
library(htmltools) # Arbitrary html
library(ggjoy) # Create joyplots
library(gganimate) # Animating ggplots
library(tweenr) # Tweening charts
library(httr) # HTTP functions
library(jsonlite) # JSON parsing
library(tidytext) # Tidy text mining
library(tm) # Tidy text mining
library(hunspell) # Text processing
library(stringdist) # String distances
library(topicmodels) # Topic modelling
library(proxy) # Distance measures
library(SnowballC) # Stemming
})
# Theming
quartzFonts(
Roboto =
c("Roboto-Light",
"Roboto-Bold",
"Roboto-Regular",
"Roboto-Thin")
)
theme_set(
theme_bw(base_family = "Roboto", base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 14,
margin = margin(0, 0, 4, 0, "pt")),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 6, hjust = 0),
axis.title = element_text(size = 10),
panel.border = element_blank()
)
)
rm(list=ls())
# Functions
source("func/02-topic-modelling-functions.R")btw %>%
# Tokenize by word
unnest_tokens(word, text_cleaned, token = "words") ->
# Assign to variable
posts_tokenized.dtWe remove words that are less than 3 characters long.
# Remove short words
posts_tokenized.dt %>%
filter(str_length(word) >= 3) ->
posts_tokenized.dtWe summarise the data format into a more compressed form and convert it into a document term matrix.
# Summarise
posts_tokenized.dt %>%
group_by(articleID, word) %>%
dplyr::summarise(term_frequency = n()) %>%
ungroup() ->
posts_tokenized.dtWe then train the LDA on the full dataset, with \(k = 40\).
Training took 50 minutes on the full dataset. We extract the word-topic probabilities, and the document-topic probabilities produced by the model.
alternatively…
# Word-topic probabilities
ldaOut %>% tidy("beta") -> posts.wtp
# Document-topic probabilities
ldaOut %>% tidy("gamma") -> posts.dtpWe can then assess the word-topic probabilities in order to get an idea of the topic that is most.
Since we have over 30,000 unique terms in the corpus, we need to extract the top few words that most uniquely define each topic, so that we can more easily visualize and label them. We use the measure of relevance defined by Sievert and Shirley (2014)1. Relevance of term \(w\) to topic \(k\) given a weight parameter \(\lambda\) between 0 and 1 \(r(w, k | \lambda)\) is computed as:
\[ r(w, k | \lambda) = \lambda\log(\phi_{kw}) + (1-\lambda)\log(\frac{\phi_{kw}}{p_w}) \]
where \(\phi_{kw}\) is the probability of term \(w\) for topic \(k\), and \(p_w\) is the empirical probability of the word in the corpus. \(\lambda\) can be thought of as a weighting term between ranking by the probability of that word within the topic, and ranking by the lift over the overall probability in the corpus.
In user studies, Sievert and Shirley (2004) found that a lambda value of 0.6 as an optimal value for allowing humans to identify the topics associated with the top words ranked by relevance. We use this same value for lambda.
posts.wtp %>%
# Compute lambda and phi_kw
mutate(lambda = 0.6, phi_kw = beta) %>%
# Compute and join the p_w
left_join(
posts_tokenized.dt %>%
group_by(word) %>%
dplyr::summarise(frequency = sum(term_frequency)) %>%
ungroup() %>%
mutate(p_w = frequency/sum(frequency)) %>%
select(-frequency),
by = c("term" = "word")
) %>%
# Compute the relevance
mutate(relevance = lambda * log(phi_kw) + (1 - lambda) * log(phi_kw/p_w)) ->
word_relevance.dtWe take a look at samples of articles labeled with the topics and the most relevant words that use them, and label them according to the prevalent theme. These are recorded in a csv file and loaded in:
### Most likely Topics per Article
# The topics function from the package is used to extract the most likely topic for each document
btw.topics <- topics(ldaOut, 2)
# Create Dataframe
doctopics.df <- as.data.frame(t(btw.topics))
doctopics.df %>%
transmute(title = rownames(.),
topic = V1) -> doctopics.df
doctopics.df$articleID <- as.integer(rownames(doctopics.df))
# Add Topic to origian DF
topics_mapping.dt <- btw %>%
mutate(articleID = as.integer(articleID)) %>%
inner_join(.,doctopics.df, by="articleID") %>%
group_by(topic) %>%
slice(1) %>%
select(topic, title.text, articleID) ## Return the top 30 terms.
btw.terms <- as.data.frame(terms(ldaOut, 30), stringAsFactors = FALSE)
topicTerms <- btw.terms %>% gather(Topic)
topicTerms <- cbind(topicTerms, Rank = rep(1:30))
topicTerms <- topicTerms %>% filter(Rank < 5)
topicTerms <- topicTerms %>% mutate(topic = stringr::word(Topic, 2))
topicTerms$topic <- as.numeric(topicTerms$topic)
topicLabel <- data.frame()
for (i in 1:40){
z <- dplyr::filter(topicTerms, topic==i)
l <- as.data.frame(paste(z[1,2], z[2,2], z[3,2], z[4,2], sep = " "), stringAsFactors = FALSE)
topicLabel <- rbind(topicLabel, l)
}
colnames(topicLabel) <- c("topic_name")
topicLabel$topic <- as.integer(rownames(topicLabel))## Combine with Topic label
topics_mapping.dt %>%
left_join(., topicLabel, by="topic") -> topics_mapping.dtWe extract the top 30 most relevant words and plot them as follows:
Facebook News Map
Now that we have labelled each topic, we produce sample documents classified to that topic for reference. We produce a random sample of 10 posts from the 300 highest probability fits per topic.
set.seed(9272)
posts.dtp %>%
group_by(title = document) %>%
summarise(topic = min(topic[gamma == max(gamma)]), gamma = max(gamma)) %>%
ungroup() %>%
inner_join(btw %>% select(title, articleID), by = "title") %>%
inner_join(topics_mapping.dt, by = "topic") %>%
mutate(topic_title =
paste0("Topic ", formatC(topic, flag = "0", width = 2),
" - ", topic_name)) %>%
group_by(topic_name) %>%
top_n(300, gamma) %>%
sample_n(10) %>%
mutate(row = row_number()) %>%
ungroup() ->
posts_classification.sdtWe produce a chart that shows this in a presentable manner.